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Shared memor y computing on clusters with symmetric multiprocessors and s ystem 
area networks 

Leonidas Kontothanassis, Robert Stets, Galen Hunt, Umit Rencuzogullari, Gautam Altekar, 
Sandhya Dwarkadas, Michael L. Scott 

August 2005 ACM Transactions on Computer Systems (TOCS), volume 23 issue 3 
Publisher: ACM Press 

Full text available: ^)pdf (918.28 KB) Additional Information: full citation , abstract , references , index terms 

Cashmere is a software distributed shared memory (S-DSM) system designed for clusters 
of server-class machines. It is distinguished from most other S-DSM projects by (1) the 
effective use of fast user-level messaging, as provided by modern system-area networks, 
and (2) a "two-level". protocol structure that exploits hardware coherence within 
multiprocessor nodes. Fast user-level messages change the tradeoffs in coherence 
protocol design; they allow Cashmere to employ a relatively simp ... 

Keywords: Distributed shared memory, relaxed consistency, software coherence 



The directory-ba se d cach e cohe rence proto col for the DASH mu ltiprocessor 
Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Anoop Gupta, John Hennessy 
May 1990 ACM SIGARCH Computer Architecture News , Proceedings of the 17th 

annual international symposium on Computer Architecture ISCA '90, volume 

18 Issue 3a 

Publisher: ACM Press 

Additional Information: full citation , abstract , references , citings, index 
terms 



Full text available: pdf( 1. 74 MB) 



DASH is a scalable shared-memory multiprocessor currently being developed at Stanford's 
Computer Systems Laboratory. The architecture consists of powerful processing nodes, 
each with a portion of the shared-memory, connected to a scalable interconnection 
network. A key feature of DASH is its distributed directory-based cache coherence 
protocol. Unlike traditional snoopy coherence protocols, the DASH protocol does not rely 
on broadcast; instead it uses point-to-point messages sent between th ... 

Cashmere-2L: software coherent shared memory on a clustered remote-write 
network 

Robert Stets, Sandhya Dwarkadas, Nikolaos Hardavellas, Galen Hunt, Leonidas 
Kontothanassis, Srinivasan Parthasarathy, Michael Scott 
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October 1997 ACM SIGOPS Operating Systems Review , Proceedings of the sixteenth 
ACM symposium on Operating systems principles SOSP '97, volume 3i issue 

5 

Publisher: ACM Press 

Full text available: ^| pdf(2.17 MB ) Additional Information: full citation , references , citings, index terms 



4 Code o ptimization - 1: Local schedulin g techniques for memory coherence in a 
clustered VLIW processor with a distributed data cache 
Enric Gibert, Jesus Sanchez, Antonio Gonzalez 

March 2003 Proceedings of the international symposium on Code generation and 
optimization: feedback-directed and runtime optimization CGO '03 

Publisher: IEEE Computer Society 

Full text available: Q pdf(1.19 MB) Additional Information: full citation , abstract , references , index terms 

Clustering is a common technique to deal with wire delays. Fully-distributed architectures, 
where the register file, the functional units and the cache memory are partitioned, are 
' particularly effective to deal with these constraints and besides they are very scalable. 
However, the distribution of the data cache introduces a new problem: memory 
instructions may reach the cache in an order different to the sequential program order, 
thus possibly violating its contents. In this paper two local sch ... 




5 The effects of communication parameters on end performance of shared virtual 
i> memory clusters 
^ Angelos Bilas, Jaswinder Pal Singh 

November 1997 Proceedings of the 1997 ACM/IEEE conference on Supercomputing 
(CDROM) 

Publisher: ACM Press 

Full text available: 'g pdf(2Q1.86 KB) Additional Information: full citation, abstract, references, citings 

Recently there has been a lot of effort in providing cost-effective Shared Memory systems 
by employing software only solutions on clusters of high-end workstations coupled with 
high-bandwidth, low-latency commodity networks. Much of the work so far has focused on 
improving protocols, and there has been some work on restructuring applications to 
perform better on SVM systems. The result of this progress has been the promise for good 
performance on a range of applications at least in the 16-32 pro ... 

Keywords: bandwidth, clustering, communication parameters, distributed memory, host 
overhead, interrupt cost, latency, network occupancy, shared memory 




6 Communicati on and consistency protocols: Detailed cache coherenc e 
% characterization for QpenMP benchmarks 
^ Jaydeep Marathe, Anita Nagarajan, Frank Mueller 

June 2004 Proceedings of the 18th annual international conference on 
Supercomputing 

Publisher: ACM Press 

Full text available: pdf (358.00 KB ) Additional Information: full citation , abstract , references , index terms 

Past work on studying cache coherence in shared-memory symmetric multiprocessors 
(SMPs) concentrates on studying aggregate events, often from an architecture point of 
view. However, this approach provides insufficient information about the exact sources of 
inefficiencies in parallel applications. For SMPs in contemporary clusters, application 
performance is impacted by the pattern of shared memory usage, and it becomes 
essential to understand coherence behavior in terms of the application progra ... 

> 
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The performance of cache-coherent rin g -based multiprocessors 
Luis Andre Barroso, Michel Dubois 

May 1993 ACM SIGARCH Computer Architecture News , Proceedings of the 20th 

annual international symposium on Computer architecture ISCA '93, volume 

21 Issue 2 
Publisher: ACM Press 

Full text available* W[ pdf (1 03 MB ) Additional Information: full citation , abstract , references , citings, inde x 
— terms 

Advances in circuit and integration technology are continuously boosting the speed of 
microprocessors. One of the main challenges presented by such developments is the 
effective use of powerful microprocessors in shared memory multiprocessor configurations. 
We believe that the interconnection problem is not solved even for small scale shared 
memory multiprocessors, since the speed of shared buses is unlikely to keep up with the 
bandwidth requirements of new microprocessors. In this paper we ... 

Paral lel simula ti on of chi p-multi processor architectur es 
Matthew Chidester, Alan George 

July 2002 ACM Transactions on Modeling and Computer Simulation (TOMACS), volume 

12 Issue 3 
Publisher: ACM Press 

Full text available: ^pdf( 519.20 KB ) Additional Information: full citation , abstract , references , index terms 

Chip-multiprocessor (CMP) architectures present a challenge for efficient simulation, 
combining the requirements of a detailed microprocessor simulator with that of a tightly- 
coupled parallel system. In this paper, a distributed simulator for target CMPs is presented 
based on the Message Passing Interface (MPI) designed to run on a host cluster of 
workstations. Microbenchmark-based evaluation is used to narrow the parallelization 
design space concerning the performance impact of distributed vs. ... 

Keywords: Chip multiprocessors (CMP)/ Myrinet, Scalable Coherent Interface (SCI), 
microbenchmarks 



9 A survey of commercial parallel p rocessor s 

^ Edward Gehringer, Janne Abullarade, Michael H. Gulyn 

V September 1988 ACM SIGARCH Computer Architecture News, volume 16 issue 4 
Publisher: ACM Press 

Full text available: ^| pdf (2.96 MB ) Additional Information: full citation , abstract , citings, index terms 

This paper compares eight commercial parallel processors along several dimensions. The 
processors include four shared-bus multiprocessors (the Encore Multimax, the Sequent 
Balance system, the Alliant FX series, and the ELXSI System 6400) and four network 
multiprocessors (the BBN Butterfly, the NCUBE, the Intel iPSC/2, and the FPS T Series). 
The paper contrasts the computers from the standpoint of interconnection structures, 
memory configurations, and interprocessor communication. Also, the share ... 

10 D ynamic node reconfi g uration in a parallel-distributed environment 
^ Michael J. Feeley, Brian N. Bershad, Jeffrey S. Chase, Henry M. Levy 

V April 1991 ACM SIGPLAN Notices , Proceedings of the third ACM SIGPLAN symposium 

on Principles and practice of parallel programming PPOPP '91, volume 26 
Issue 7 
Publisher: ACM Press 
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11 Memory coherence activit y prediction in commercial workloads 
|k Stephen Somogyi, Thomas F. Wenisch, Nikolaos Hardavellas, Jangwoo Kim, Anastassia 
^ Ailamaki, Babak Falsafi 

June 2004 Proceedings of the 3rd workshop on Memory performance issues: in 
conjunction with the 31st international symposium on computer 
architecture WMPI '04 
Publisher: ACM Press 

Full text available: ^ pdf(434.65 KB ) Additional Information: full citation, abstract, references , index terms 

Recent research indicates that prediction-based coherence optimizations offer substantial 
performance improvements for scientific applications in distributed shared memory 
multiprocessors. Important commercial applications also show sensitivity to coherence 
latency, which will become more acute in the future as technology scales. Therefore it is 
important to investigate prediction of memory coherence activity in the context of 
commercial workloads.This paper studies a trace-based Downgrade Predi ... 

Keywords: coherence misses, coherence prediction, commercial workloads, sharing 
patterns, trace-based prediction 




12 Cache inclusion and processor sam pling in multipro cessor si mulations 
Jacqueline Chame, Michel Dubois 

>^ June 1993 ACM SI G METRICS Performance Evaluation Review , Proceedings of the 
1993 ACM SIGMETRICS conference on Measurement and modeling of 
computer systems SIGMETRICS '93, volume 21 issue 1 
Publisher: ACM Press 

Full text available' f 5 ! odfd 17 MB) Additional Information: full citation, abstract, r eferenc es, citings, index 

terms, review 

The evaluation of cache-based systems demands careful simulations of entire benchmarks. 
Simulation efficiency is essential to realistic evaluations. For systems with large caches 
and large number of processors, simulation is often too slow to be practical. In particular, 
the optimized design of a cache for a multiprocessor is very complex with current 
techniques. This paper addresses these problems. First we introduce necessary and 
sufficient conditions for cache inclusion in systems with invalid ... 

13 Real-time shading 

^ Marc Olano, Kurt Akeley, John C. Hart, Wolfgang Heidrich, Michael McCool, Jason L Mitchell, 
Randi Rost 

August 2004 Proceedings of the conference on SIGGRAPH 2004 course notes 
SIGGRAPH '04 

Publisher: ACM Press 

Full text available: g|pdf(7.39 MB) Additional Information: full cita t ion, abstract 

Real-time procedural shading was once seen as a distant dream. When the first version of 
this course was offered four years ago, real-time shading was possible, but only with one- 
of-a-kind hardware or by combining the effects of tens to hundreds of rendering passes. 
Today, almost every new computer comes with graphics hardware capable of interactively 
executing shaders of thousands to tens of thousands of instructions. This course has been 
redesigned to address today's real-time shading capabili ... 

14 Parallel execution of prolog pro g rams: a survey 

Gopal Gupta, Enrico Pontelli, Khayri A.M. AM, Mats Carlsson, Manuel V. Hermenegildo 
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^ July 2001 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 23 Issue 4 
Publisher: ACM Press 

Full text available: f?| pdf(1.95MB) Additional Information: full citation , abstract , references , citings, index 

terms 

Since the early days of logic programming, researchers in the field realized the potential 
for exploitation of parallelism present in the execution of logic programs. Their high-level 
nature, the presence of nondeterminism, and their referential transparency, among other 
characteristics, make logic programs interesting candidates for obtaining speedups 
through parallel execution. At the same time, the fact that the typical applications of logic 
programming frequently involve irregular computatio ... 

Keywords: Automatic parallelization, constraint programming, logic programming, 
parallelism, prolog 



15 Formal verif ica tion in hardware d e si g n: a surve y 
<gk Christoph Kern, Mark R. Greenstreet 

V April 1999 ACM Transactions on Design Automation of Electronic Systems (TODAES), 

Volume 4 Issue 2 
Publisher: ACM Press 

Full text available- t?) pdf(41 1 53 KB) Additional Information: full citation, abstract, references, citings, index 
l^j terms 

In recent years, formal methods have emerged as an alternative approach to ensuring the 
quality and correctness of hardware designs, overcoming some of the limitations of 
traditional validation techniques such as simulation and testing. There are two main 
aspects to the application of formal methods in a design process: the formal framework 
used to specify desired properties of a design and the verification techniques and tools 
used to reason about the relationship between a spec ... 

Keywords: case studies, formal methods, formal verification, hardware verification, 
language containment, model checking, survey, theorem proving 



16 The Mercury Interconnect Architecture: a cost-effective infrastructure for hi gh- 
performance servers 

^ Wolf-Dietrich Weber, Stephen Gold, Pat Helland, Takeshi Shimizu, Thomas Wicki, Winfried 
Wilcke 

May 1997 ACM SIGARCH Computer Architecture News , Proceedings of the 24th 

annual international symposium on Computer architecture ISCA '97, volume 

25 Issue 2 

Publisher: ACM Press 

Full text available: f^l pdfd 53 MB) Additional Information: full citation , abstract , references, citings, index 

terms 

This paper presents HAL's Mercury Interconnect Architecture, an interconnect 
infrastructure designed to link commodity microprocessors, memory, and I/O components 
into high-performance multiprocessing servers. Both shared-memory and message- 
passing systems, as well as hybrid systems are supported by the interconnect. The key 
attributes of the Mercury Interconnect Architecture are: low latency, high bandwidth, a 
modular and flexible design, reliability/availability/serviceability (RAS) features, ... 

17 Visual navi g ation of lar g e environments usin g textured clusters 
^ Paulo W. C. Maciel, Peter Shirley 

>^ April 1995 Proceedings of the 1995 symposium on Interactive 3D graphics 

Publisher: ACM Press 
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Full text available: ^ pdf(2.93 MB) Additional Information: full citation , abstract , references , citings , index 

terms 

A visual navigation system is described which uses texture mapped primitives to represent 
clusters of objects to maintain high and approximately constant frame rates. In cases 
where there are more unoccluded primitives inside the viewing frustum than can be drawn 
in real-time on the workstation, this system ensures that each visible object, or a cluster 
that includes it, is drawn in each frame. The system supports the use of traditional "level- 
of-detail" representations for indi ... 

18 Memory access bufferin g in multiprocessors 
M. Dubois, C. Scheurich, F. Briggs 

June 1986 ACM SIGARCH Computer Architecture News , Proceedings of the 13th 

annual international symposium on Computer architecture ISCA '86, volume 
14 Issue 2 

Publisher: IEEE Computer Society Press, ACM Press 

Full text available' ffl pdf (943.66 KB ) Additional Information: full citation , abstract , references , citings, index 

terms 

In highly-pipelined machines, instructions and data are prefetched and buffered in both 
the processor and the cache. This is done to reduce the average memory access latency 
and to take advantage of memory interleaving. Lock-up free caches are designed to avoid 
processor blocking on a cache miss. Write buffers are often included in a pipelined 
machine to avoid processor waiting on writes. In a shared memory multiprocessor, there 
are more advantages in buffering memory requests, since each m ... 

19 Techniq ues for reducin g consistency-related communication in distributed sha red- 1 
<§> m emory. s ystems 

^ John B. Carter, John K. Bennett, Willy Zwaenepoel 

August 1995 ACM Transactions on Computer Systems (TOCS), Volume 13 issue 3 
Publisher: ACM Press 

Full text available- fill pdf (2 86 MB ) Additional Information: full citation , abstract , references , citings , index 

terms , review 

Distributed shared memory (DSM) is an abstraction of shared memory on a distributed- 
memory machine. Hardware DSM systems support this abstraction at the architecture 
level; software DSM systems support the abstraction within the runtime system. One of 
the key problems in building an efficient software DSM system is to reduce the amount of 
communication needed to keep the distributed memories consistent. In this article we 
present four techniques for doing so: software release consistency; m ... 

Keywords: cache consistency protocols, distributed shared memory, memory models, 
release consistency, virtual shared memory 

20 Is data distribution necessary in OpenMP? | 
Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, Constantine D. Polychronopoulos, 
Jesus Labarta, Eduard Ayguade;eacute; 

November 2000 Proceedings of the 2000 ACM/IEEE conference on Supercomputing 
(CDROM) 

Publisher: IEEE Computer Society 

Full text available: g|,gdf (1 16.52 KB ) Additional Information: full citation , abstract , references , citings, index 
fl Publisher Site terms 

This paper investigates the performance implications of data placement in OpenMP 
programs running on modern ccNUMA multiprocessors. Data locality and minimization of 
the rate of remote memory accesses are critical for sustaining high performance on these 
systems. We show that due to the low remote-to-local memory access latency ratio of 
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state-of-the-art ccNUMA architectures, reasonably balanced page placement schemes-such 
as round-robin or random distribution of pages-incur modest performa ... 
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